4a) Errors and power — torturing the t-test

Data Simulation with Monte Carlo Methods

Marko Bachl

University of Hohenheim

Errors and power — torturing the t-test

Traktandenliste

  1. Introduction & overview

  2. Monte Carlo Simulation?

  3. Proof by simulation — the Central Limit Theorem (CLT)

  4. Errors and power — torturing the t-test

  5. Misclassification and bias — Messages mismeasured

  6. Outlook: What’s next?

4. Errors and power — torturing the t-test

  1. Comparison of Student’s and Welch’s t-tests

  2. A priori power calculation for Welch’s t-test

Student’s t-test & Welch’s t-test

Student’s t-test for two independent means

\(t = \frac{\bar{X}_1 - \bar{X}_2}{s_p \sqrt\frac{2}{n}}\), where \(s_p = \sqrt{\frac{s_{X_1}^2+s_{X_2}^2}{2}}\) and \(df = n_1 + n_2 − 2\)

Welch’s t-test for two independent means

\(t = \frac{\bar{X}_1 - \bar{X}_2}{s_{\bar\Delta}}\), where \({\bar\Delta} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\) and \(df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{\left(s_1^2/n_1\right)^2}{n_1-1} + \frac{\left(s_2^2/n_2\right)^2}{n_2-1}}\)

Student’s t-test & Welch’s t-test

  • Old school advice: Student’s t-test for equal variances, Welch’s t-test for unequal variances.

    • Higher power of Student’s t-test if assumptions hold.
  • Modern advice: Always use Welch’s t-test.

    • Better if assumptions are violated; not worse if assumptions hold.

    • e.g., Delacre, M., Lakens, D., & Leys, C. (2017). Why Psychologists Should by Default Use Welch’s t-test Instead of Student’s t-test. International Review of Social Psychology, 30(1). https://doi.org/10.5334/irsp.82

  • For those who don’t care about t-tests: Idea also applies to heteroskedasticity-consistent standard errors.

First Simulation study: False discoveries

  1. Question: What is the goal of the simulation?

    • Comparison of Student’s and Welch’s t-tests
  2. Quantities of interest: What is measured in the simulation?

    • \(\alpha\) error rate of the tests
  3. Evaluation strategy: How are the quantities assessed?

    • Comparison of empirical distribution of p-values with theoretical distribution under the Null hypothesis
  4. Conditions: Which characteristics of the data generating model will be varied?

    • Group size ratio; Standard deviation ratio
  5. Data generating model: How are the data simulated?

    • Random draws from normal distributions with equal means, but different group sizes and standard deviations

Practical considerations

  • How does the t.test() work in R?

  • Which parameters and functions do we need for generating the data?

  • How do we implement the experimental design for the simulation study?

  • How do we run the simulation?

  • How do we assess and visualize the results?

One simulation: Parameters

  • Two factors:

    • Group sizes: Equal or unequal?

    • Group standard deviations: Equal or unequal?

  • Implementation with ratios and fixed total sample size.

set.seed(28)
n = 150 # fixed total sample size
GR = 2 # n2/n1
n1 = round(n / (GR + 1))
n2 = round(n1 * GR)
sd1 = 1
SDR = 2  # sd2/sd1
sd2 = sd1 * SDR

One simulation: t-tests

g1 = rnorm(n = n1, mean = 0, sd = sd1)
g2 = rnorm(n = n2, mean = 0, sd = sd2)
t.test(g1, g2) # Welch (default)
t.test(g1, g2, var.equal = TRUE) # Student

    Welch Two Sample t-test

data:  g1 and g2
t = 0.74022, df = 147.94, p-value = 0.4603
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.3198803  0.7030466
sample estimates:
  mean of x   mean of y 
-0.09838857 -0.28997168 

    Two Sample t-test

data:  g1 and g2
t = 0.60009, df = 148, p-value = 0.5494
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.4393035  0.8224698
sample estimates:
  mean of x   mean of y 
-0.09838857 -0.28997168 

Inspect the output of t.test()

t.test(g1, g2) %>% str()
List of 10
 $ statistic  : Named num 0.74
  ..- attr(*, "names")= chr "t"
 $ parameter  : Named num 148
  ..- attr(*, "names")= chr "df"
 $ p.value    : num 0.46
 $ conf.int   : num [1:2] -0.32 0.703
  ..- attr(*, "conf.level")= num 0.95
 $ estimate   : Named num [1:2] -0.0984 -0.29
  ..- attr(*, "names")= chr [1:2] "mean of x" "mean of y"
 $ null.value : Named num 0
  ..- attr(*, "names")= chr "difference in means"
 $ stderr     : num 0.259
 $ alternative: chr "two.sided"
 $ method     : chr "Welch Two Sample t-test"
 $ data.name  : chr "g1 and g2"
 - attr(*, "class")= chr "htest"
t.test(g1, g2)$p.value
[1] 0.4603425

Wrap it into a function

sim_ttest = function(GR = 1, SDR = 1, n = 150) {
  n1 = round(n / (GR + 1))
  n2 = round(n1 * GR)
  sd1 = 1
  sd2 = sd1 * SDR
  g1 = rnorm(n = n1, mean = 0, sd = sd1)
  g2 = rnorm(n = n2, mean = 0, sd = sd2)
  welch = t.test(g1, g2)$p.value
  student = t.test(g1, g2, var.equal = TRUE)$p.value
  res = list("Welch" = welch, "Student" = student)
  return(res)
}

Call the function once

sim_ttest(GR = 2, SDR = 0.5) # Implies n1 = 50, n2 = 100, sd1 = 1, sd2 = 0.5
$Welch
[1] 0.5069771

$Student
[1] 0.4110377

Setup of experimental conditions

expand_grid(a = 1:2, b = 3:4)
# A tibble: 4 × 2
      a     b
  <int> <int>
1     1     3
2     1     4
3     2     3
4     2     4

Setup of experimental conditions

  • Two between-simulation factors: GR and SDR
conditions = expand_grid(GR = c(1, 2),
                         SDR = c(0.5, 1, 2)) %>% 
  rowid_to_column(var = "condition")
conditions
# A tibble: 6 × 3
  condition    GR   SDR
      <int> <dbl> <dbl>
1         1     1   0.5
2         2     1   1  
3         3     1   2  
4         4     2   0.5
5         5     2   1  
6         6     2   2  

Setup of experimental conditions

condition GR SDR implies
1 1 0.5 n1 = 75, n2 = 75, sd1 = 1, sd2 = 0.5
2 1 1.0 n1 = 75, n2 = 75, sd1 = 1, sd2 = 1
3 1 2.0 n1 = 75, n2 = 75, sd1 = 1, sd2 = 2
4 2 0.5 n1 = 50, n2 = 100, sd1 = 1, sd2 = 0.5
5 2 1.0 n1 = 50, n2 = 100, sd1 = 1, sd2 = 1
6 2 2.0 n1 = 50, n2 = 100, sd1 = 1, sd2 = 2

Run simulation experiment

set.seed(689)
i = 10000 # number of sim runs per condition
tic() # simple way to measure wall time
sims = map_dfr(1:i, ~ conditions) %>% # each condition i times
  rowid_to_column(var = "sim") %>% # within simulation comparison
  rowwise() %>%
  mutate(p.value = list(sim_ttest(GR = GR, SDR = SDR))) %>% 
  unnest_longer(p.value, indices_to = "method")
toc() # simple way to measure wall time
14.832 sec elapsed

Inspect simulated data

# A tibble: 120,000 × 6
     sim condition    GR   SDR p.value method 
   <int>     <int> <dbl> <dbl>   <dbl> <chr>  
 1     1         1     1   0.5 0.345   Welch  
 2     1         1     1   0.5 0.344   Student
 3     2         2     1   1   0.534   Welch  
 4     2         2     1   1   0.534   Student
 5     3         3     1   2   0.647   Welch  
 6     3         3     1   2   0.647   Student
 7     4         4     2   0.5 0.00985 Welch  
 8     4         4     2   0.5 0.00224 Student
 9     5         5     2   1   0.602   Welch  
10     5         5     2   1   0.614   Student
# … with 119,990 more rows
# ℹ Use `print(n = ...)` to see more rows

Results: Histogram of p-values

  • Distribution of p-values under the Null hypothesis should be \(\sf{Uniform}(0, 1)\).
sims %>% 
  ggplot(aes(p.value)) +
  geom_histogram(binwidth = 0.05, boundary = 0) + 
  facet_grid(method ~ GR + SDR, labeller = label_both) + 
  scale_x_continuous(breaks = c(1, 3)/4)

Results: Histogram of p-values

Results: Q-Q-plot of p-values

  • Distribution of p-values under the Null hypothesis should be \(\sf{Uniform}(0, 1)\).
sims %>% 
  ggplot(aes(sample = p.value, color = method)) +
  geom_qq(distribution = stats::qunif, size = 0.2) + 
  geom_abline(slope = 1) + 
  facet_grid(GR ~ SDR, labeller = label_both)

Results: Q-Q-plot of p-values

Results: Looking at the relevant tail

sims %>% 
  group_by(GR, SDR, method) %>% 
  summarise(P_p001 = mean(p.value <= 0.001),
            P_p01 = mean(p.value <= 0.01), 
            P_p05 = mean(p.value <= 0.05)) %>% 
  kable(digits = 3)
GR SDR method P_p001 P_p01 P_p05
1 0.5 Student 0.001 0.010 0.049
1 0.5 Welch 0.001 0.010 0.048
1 1.0 Student 0.000 0.010 0.048
1 1.0 Welch 0.000 0.010 0.048
1 2.0 Student 0.001 0.011 0.047
1 2.0 Welch 0.001 0.010 0.047
2 0.5 Student 0.008 0.034 0.107
2 0.5 Welch 0.002 0.010 0.049
2 1.0 Student 0.001 0.010 0.051
2 1.0 Welch 0.000 0.010 0.052
2 2.0 Student 0.000 0.002 0.016
2 2.0 Welch 0.001 0.009 0.052

Questions?

Group exercise 2

Exercise 2a.

  • Investigate how the amount of disparity in group sizes and standard deviations changes the results.

    • When do the differences between the methods become neglegible?

    • When does the false discovery rate of Student’s t-test become really bad?

  • Adapt the experimental setup accordingly.

    • Change or expand the levels of the GR and SDR factors.

    • Consider changing the experimental design to make the simulation more efficient.

Solution 2a

  • TODO